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ASSESSING  THE  UNIDIMENSIONALITY  OF  TESTS 

By 

John  Dixon 
May  1985 

Chairman:  James  Algina 

Major  Department:  Foundations  of  Education 

Much     of  the     theory     and     practice  of  psychological 
measurement  has  been  based  on     the  assumption  that  a  single 
trait  is  being  measured.     Frequently  this  association  is 
tested  by  factor  analysis  of  a  matrix  of  tetrachoric 
correlations  among  the  item  scores.     The  purpose  of  this 
dissertation  was  to  compare  two  commonly  used  criteria  for 
deciding     upon     the     number  of  factors     underlying  test 
performance.     The  two  criteria  used  to  identify  the  number 
of  factors  were  the  number  of  eigenvalues  greater  than  one 
and  a  chi-square  statistic. 

The  comparison  was  conducted  using  data  created  by  a 
Monte  Carlo  simulation  based  on  three  different     models  for 
how  people  respond  to  tests  of  various  characteristics. 
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A  single  trait  model     was     used     throughout    the  simulation. 
The  data  characteristics  which  were  varied  in  each  simulation 
were  the  number    of  items,  the  number  of  subjects  taking  the 
test,   the  distribution  of  the  trait,  and  the  variation  in  the 
difficulty  of  the  items  within  a  test. 

Neither  criterion  for  assessment  of  dimensionality  was 
sensitive  to  the  distribution  of     the  trait.     Fewer  items 
or    more     subjects    available  improved    the  probability  of 
finding  one  trait  if  only  one  existed  with  both  methods. 
The  degree  to  which  guessing  was     prevalent  did  not  increase 
the  likelihood  of  finding  spurious  trait  indications  for 
the  chi-square  criterion,  but  did  for  the  eigenvalue 
criterion . 
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CHAPTER  I 
INTRODUCTION 

Overview 

The  purpose  of  this  dissertation  was  to  investigate  the 
utility    of     factor    analysis     for     investigating  the 
dimensionality  of  tests.     This  chapter  includes 

1.  A  definition  of  dimensionality. 

2.  Definitions    of  common  models  from    item  response 
theory . 

3.  A  statement  of  the  research  questions. 

4.  A  brief  description  of  the  methodology  employed. 

Definition  of  Dimensionality 
It  is  typical     to  find  that  the  scores  on     all  pairs  of 
items  on  a  test  are     statistically  dependent.       As  a  result 
the  items  can  be  considered  to  measure  one  or  more  variables 
or  traits  in  common.       Such  traits  are  called  latent  traits 
since  they  cannot  be  directly  observed.     Their  existence  is 
inferred  from  the  statistical  dependence  among  scores.  The 
problem  of     defining  dimensionality  of     a  test  is     that  of 
defining  what     is  meant     by  the     number  of     latent  traits 
measured  by  the     test.       Since  the  latent     traits  are  the 
source  of     the  statistical     dependence  among     items,  one 
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reasonable  definition  of  the  dimensionality  of  a  test  is  the 
number    of  latent    traits  required    to    account  for  the 
statistical  dependence  among  items.       This  is  the  definition 
adopted     in    item    response    theory.        Although  the 
dimensionality  of     a  test  has     been  defined  in    many  ways, 
recently  there    has  been  a     convergence  of     opinion  among 
psychometricians     that  the     definition     adopted  in  item 
response  theory  is  correct  (Lord,   1980;  McDonald,  1981). 

More  precisely  the  item  response    theory  definition  of 
dimensionality  is  the     number  of  latent  traits     required  to 
induce  local  independence  between  all     pairs  of  items  on  the 
test.     Two  items  are  locally  independent  if 

Pgh(e)  =  Pg(©)  Ph(©)  (^"^^ 

where  0  is     a  K  dimensional  random  vector    of  latent  traits 
with  K     less  than  the  number    of  items,     Pgh^©)       is  the 
probability  that  an    examinee  with  score  pattern    9  answers 
items  g  and  h  correctly,     PgO)     is  the  probability  that  an 
examinee  with  this  score  pattern    answers  item  g  correctly, 
and  Ph(©)     is  defined  similarly  for  item  h.       The  equation 
states  that  the     scores  on  items  g  and     h  are  statistically 
independent  for    the  subpopulation  of  examinees    with  that 
particular  score     pattern  on  6.         Since  the     scores  are 
dependent     in  the     population  of     interest,     this  local 
independence  implies  that  the  dependence  among  the  responses 
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to  item  g  and  h  is  accounted  for  by  the  K  dimensional  latent 
vector  e.     Thus  the  test  is  K  dimensional.       If  K=l  then  the 
test  is  unidimensional  since  a     single  latent  trait  accounts 
for  the     statistical  relationship  between  members    of  each 
pair  of  items. 

Item  Response  Theory  Models 
Several  models  for  the  relationship    between  6  and  PgO) 
have  been  proposed  for  use    when  the  test  is  unidimensional. 
One  of  these  models  is  the  normal  ogive  model 

2 

Aq(e-Bq)        1  ("t  /2) 

Pa(e)  =  J  e  dt  (1-2) 

^  — oo 

In  this  model  6  is  the  unidimensional  latent  trait,  Ag  is  an 
item  discrimination  parameter  for  the    gth  item,     Bg  is  an 
item  difficulty     parameter  for     the  gth     item,     and  the 
integrated  function  is  a     standard  normal  density  function. 
The  model  states  the  relationship    between  PgO)     and  6  has 
the  form  of  a  cumulative  normal  ogive. 

In  addition  to  the  normal     ogive  model  there  are  several 
logistic  models  commonly  used  in  item  response  theory.  The 
logistic  models  are     differentiated  by  the  number     of  item 
parameters  included  in     a  particular     model.       The  one- 
parameter  model  is 

(e-Bq) 

PgO)  =  e   (1-3) 

(e-Bq) 
1  +  e  ^ 
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where  the  probability  of  responding  correctly  is  dependent 
on  the  latent  score  6  and  the  item  difficulty  parameter  Bg. 
The  two-parameter  logistic  model 

Aq(e-Bq) 

Pg(e)  =  e  (1-4) 
Aq(e-Bq) 

1  +  e 

adds  an     item  discrimination  Ag     to  the  model.         In  both 
qualitative  and  quantitative  terms  this     model  is  similar  to 
the  normal  ogive  model.     The  three-parameter  logistic  model 

Aq(e-Bq) 

Pg(e)      =     Cg+(1-Cg)      e  (I'S) 

1+e  ^  ^ 

includes  a  guessing  term  Cg  in  the  model. 

Research  Questions 
Lord  and  Novick  (1968)  pointed  out  a  relationship  between 
the  normal     item  response     model  and     the  common  factor 
analysis  of  binary  items.     If  the  relationship  between  PgO) 
and  e  is  described    by  the  normal  ogive  model  and     if  9  is 
■normally  distributed  in  the  group  tested,   then  the  matrix  of 
tetrachoric  intercorrelations     among  the  items  will     be  of 
unit  rank     after  appropriate  communalities  are     inserted  in 
the  principal  diagonal.     Thus  under  the  stated  conditions,  a 
common  factor  analysis     model  with  one  factor    will  fit  the 
matrix  of  tetrachoric  correlations.     Violations  of  these 
conditions     may  produce     more     than     one  common  factor. 
Nevertheless     it  may     be  that     factor     analysis  of  the 
tetrachoric  correlations  can  provide  a     useful  guide  to  the 
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dimensionality  of  the  test  even  when  the  stated  conditions 
are  not  fully  met. 

While  there  are  many  approaches  to  determining  the  number 
of  factors,   two  were  investigated  in  this  study.     The  first 
was  the  Guttman-Kaiser  criterion  of  the  number  of  eigenvalues 
greater  than  one,     with  the  eigenvalues  calculated  from  the 
matrix  of  tetrachoric  correlations.     The  second  was  the  chi- 
square     goodness  of     fit     test     with     regard     to  the  number 
of  factors.       This     test  is  based  on  the    generalized  least 
squares  approach  to  factor  analysis   (Joreskog  and  Goldberger, 
1972  )  . 

The    relationship  between    factor    analysis  and  item 
response  theory  suggests  the  following  research  questions: 

1.  Under  conditions  in    which  factor  analysis  should 
indicate  unidimensionality  (that  is,       when  the  normal 
ogive  model     fits  the  data    and  the  latent    trait  is 
normally  distributed),  does  each  criterion  indicate  a 
single  factor? 

2.  How    well     does  each  factor    analytic  criterion 
indicate  unidimensionality  when  the  normal  ogive  model 
fits  the  data,  but  the  latent  trait  is  not  normally 
distributed? 

3.  How  well  do  the  factor  analytic  criteria  indicate 
unidimensionality    when  a     two     or  three  parameter 
logistic  model  fits  the  data,     and  the  latent  trait  is 
normally  distributed? 


4.       How  well  do  the  factor  analytic  criteria  indicate 
unidimensionality  when  a  two    or  three  parameter  model 
fits     the    data    and  the    trait     is  non-normally 
distributed? 

In  addition  it  seemed  important  to  investigate  the  impact  of 

V  /V' 
test  length,     number  of  examinees,     and  difficulty  range  on 

the    effectiveness    of  the  two  criteria    in  indicating 

unidimensionality. 

Method  of  Investigation 

To  investigate  the  utility  of     factor  analytic  criteria 

for  determining  unidimensionality,     there  are  two  possible 

approaches.       One  approach  would  use     real  test  data  sets, 

whereas  the  second  would  use  simulated  test  data.     Using  real 

data  would  require    a  method  with  acknowledged  validity  for 

determining  the  dimensionality  of  the  tests.     The  results  of 

applying  this  method  could  be  compared  to  the  results  of  the 

factor  analytic  methods.       Unfortunately  there  is  no  method 

with    broadly    acknowledged    validity    for  determining 

dimensionality.     The  prime  advantage  of  using  simulated  data 

is  that  the  dimensionality  of     the  simulated  data  is  known. 

Generation  of  simulated  data  also  provides     better  control 

over  the  data  characteristics,   provides  far  more  data,  and 

allows  an  investigation  of  a  wider  variety  of  characteristics 

than  could  be  achieved  through  collection  of  actual  examinee 

data.       The  main  problem  with  simulated  data  is  that  their 
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characteristics    may    not    adequately    represent  the 
characteristics  of  real  data. 

Rationale 

The  unidimensionality  of     a  test  has  been     of  concern  to 
those  involved  in  testing  for  two  major  reasons.     First,  the 
interpretation  of     a  test  is     clearest  when  that    test  is 
unidimensional .         If  a    test    measures    more  than  one 
characteristic,     it  is  unknown  to    what  extent  each  of  the 
characteristics  contributes  to  the  score.       Secondly,  many  of 
the  test  construction  methods  assume     that  the  items  measure 
a  single    trait.       This  assumption    is  common  to    all  the 
unidimensional  latent     trait  theories,     such  as     the  Rasch 
model  and  the  three  parameter  logistic  model.       A  variety 
of  ^  methods     have     been     suggested     for     checking     on  the 
unidimensionality  of  a  set  of  items,  but  no  single  method 
is  universally  accepted  as  superior. 

The  purpose  of  this     study    was  to    compare    two  factor 
analytic  criteria  for  assessing  dimensionality.     It  is  hoped 
that  findings  from  this  study  will  provide  researchers  and 
practitioners  with  some  direction     in  making  choices  between 
these  criteria  and  indicate  when  caution  should  be  exercised 
in  relying  on  these  criteria  when  assumptions  of  the  procedures 
are  violated.       The  degree  of  violation     allowable,  suggested 
by  the  results  of  the  present  study,  would  lend  considerable 
help  to  practitioners  who  must  decide  if  some  test  construction 
methods,     such     as  the     logistic  models,   are  appropriate  for 
their  data. 


CHAPTER  II 
REVIEW 

A  wide    variety  of    methods  have    been  used    to  assess 
unidimensionality .     Commonly  used  assessment  methods  can  be 
classed     into     three     categories  of     reliability  indices, 
item  response  models,  or  factor  analysis. 

Green,   Lissitz,  and  Mulaik   (1977)   investigated  the  use  of 
internal  consistency  indices     to  test  for  unidimensionality. 
They  generated     several     sets  of  data     varying     the  number  of 
common  factors,     the  number     of  factors     relating     to  a  single 
item,  the  communalities ,  and  the  number  of  repetitions  of  the 
items  in  the  model.       To  compare  the  different  samples,  they 
computed  the  average     intercorrelation,     coefficient  alpha, 
and  their  own  index  of  unidimensionality.       They  found  that 
alpha  increased  as  the  number  of  items  increased,  especially 
when  there  were  parallel  items.        Alpha  increased  as  the 
number  of  factors  pertaining  to  each  item  increased.  Alpha 
decreased  when  communalities  decreased.         Their  index  was 
relatively,  independent     of  the     number  of     items  and  the 
communality  of  the  items.       It  was  also  less     sensitive  to 
the  number     of  factors     pertaining  to     each  item  than  was 
coefficient  alpha.     However,   Hattie   (1981)  examined  the  index 
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generated  by  Green  et  al .  as    well  as     a  number    of  other 
related  indices,   such  as  coefficient  alpha.     He  found  that 
all     of     the     indices     studied    were     poor     indices  of 
unidimensionality .       In  his  simulation  the  index  proposed  by 
Green  et  al.  often  exceeded  its  theoretical  maximum. 

Be jar  (1980)  suggested  a  method  for  investigating  the 
unidimensionality  of  tests  that  is  based  on  item  response 
theory.       He  described  two  related  methods     which  can  be 
used  if  the  items  can  be  divided  into  suspected  subsets 
based    on    content    or  some  other  item  characteristics. 
In    the  first  procedure,  the  item  difficulty  parameters 
from  the  two     sets   (the  subset  and  the    total  test)  are 
tested     for  equivalence.        The  second  procedure  compares 
the  content  based  correlations     and  the  total     test  base 
correlations    between    the    parameter      estimates.  These 
two  methods     have     the    drawback    of     requiring    a 'priori 
division  of  the  items  into  subsets.       An  examination  of 
a  number    of  subsets  to  search  for     lack  of  homogeneity 
could  be  expensive,     and  could  not     insure  that  the  subsets 
examined  were  those  which  would  produce  mul tidimensionality . 

Hattie  (1984)  used  a  simulation  study  examining  a  number 
of  coefficients  and  unidimensionality  assessment  techniques 
for  a     variety  of  data     characteristics.       He     found  that 

fitting  a  two-parameter  model  and  examining  the  residuals 
proved  the  most  effective.       All     other  methods  were  found 
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ineffective.       The  study  was  limited  in  terms  of  the  number 
of  items  and  the  number  of  examinees  used.       This  limits  the 
interpretation  so  severely  that  it     could  only  point  towards 
a  probable  good  method,  but  could  not  be  taken  as  eliminating 
many  of  the  other  methods.       Because  Hattie     (1984)  found 
that    the    most    commonly    used  procedure,     linear  factor 
analysis,     overestimated    the    number     of     factors     in  the 
unidimensional     case,     investigation    of     the  limitations 
of  the  factor  analytic  method  for     the  unidimensional  case 
was     the  focus  of  this  dissertation. 

Factor  analysis  has  also  been  suggested  as  a  method  for 
assessing  the  unidimensionality  of  a  test  by  Lord  (1980) 
and  McDonald  (1984).     When  test  items  are  dichotomously 
scored,  the  factor  analysis  is  typically  done  on  either  phi 
or  tetracjioric  correlation  coefficients.     Phi  coefficients 
vary  with  the  range  of  difficulty  of  the  items,  which  may 
produce  spurious  factors   (Wherry  and  Gaylord,   1944).  A 
disadvantage  of  tetrachorics  is  that  they    are  difficult  to 
estimate  and  may  not  provide  a  correlation  matrix  that  is 
positive  definite.     A  desirable  feature  of  the  tetrachoric 
correlations  is  that  if  they  produce  only  one  factor,  then 
that  is  a  sufficient     condition  for  unidimensionality  (Lord 
and  Novick,   1968).       Hambleton     et  aK    (1978)   in  a  recent 
review  of  latent  trait     theory  suggested  that  factor  analysis 
is  the    most  promising     of  the     dimensionality  assessment 
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methods.       Thus  the  factor     analytic     based    method  was  the 
focus  of  the  present  study. 

Simulation  Studies  Based  on  Different  Data  Models 
Simulation     studies     of     characteristics     of  tests  have 
^typically      used  one  or    two  models  for  generating  data. 
For  example,  Reckase  (1979)     used    both  a  one  parameter 
and  a  three  parameter  latent    trait  model  to  simulate  item 
responses.     The  one     parameter  model     included  terms     for  a 
person's  ability,  the  difficulty  of  the  item,  and  the 
subject's  response.     The  three  parameter  model  added  guessing 
and  discrimination  parameters.     Cudeck  (1980)     used  the  three 
parameter  model  similar  to  the  one  used  by  Reckase.     He  used 
several     levels  of  each  parameter,     then    performed  analysis 
of  variance    on  the    results.     This  was  a  study  of  internal 
consistency  indices. 

In     the     present     investigation     three     models     for  data 
generation  were  used.     These  were  the  normal  ogive,  the  two 
parameter  logistic,  and  the  three  parameter  logistic  models. 

The     normal     ogive     simulation    model  used  in  this  study 
was  patterned  after  that  used  by  Cotton,  Campbell,  and  Malone 
(1957),   and  more  recently  by  Reckase   (1979).       The  procedure 
takes  the     desired  factor  loading    matrix  and     the  desired 
traditional  item  difficulty  parameters  for     one  item.  Z- 
scores  are  produced  for  each  person     on  each  item  based  on  a 
generated  random  normal  trait,  weighted  by  the  factor  model. 
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The  resulting     items  are  then    dichotomized  based    on  the 
traditional  item  difficulty  parameters. 

The  two  and     three  parameter  data  simulation    models  are 
similar.       The    method     used     in     DATGEN     (Hambleton  and 
Rovinelli,   1973)  typifies  these  methods  of  data  generation. 
Parameters  for  each  item  are  generated.       The  distributions 
of     the  item  parameters  were     either     normally  distributed, 
(e.g.  Yen,   1981)  or  normally  distributed   (e.g.  Rudner,  Getson 
and  Knight,   1980).       To  get  the  ability  scores,     a  Z-score 
is  generated  for    each  person.     This  score  is  transformed  to 
fit    the    ability  range,     then    dichotomized    based    on  the 
logistic  model  being  used. 

Data  Characteristics  That  Affect  Tests  of  Dimensionality 
Item  difficulty  distribution  has  long  been  recognized  as 
a  factor  that  may  affect  the  apparent  dimensionality  of  test 
scores.     Guilford   (1941)   investigated  the  possibility  that  a 
"difficulty  factor"  was  being  extracted     from  an  analysis  of 
a  test.       He  interpreted  the  larger  than  expected  number  of 
factors  from  a    unidimensional  test  to  be    due  to  different 
abilities  being  tested  at  different  difficulty  levels  rather 
than  the     effect  being  due  to     an  artifact  of     the  factor 
model.       Ferguson   (1941)  demonstrated  that  matrices  of  item 
inter-correlations  which  are  heterogeneous     with  respect  to 

difficulty  will  have  greater  rank,  and  so  will  demonstrate 
greater  factorial  complexity.       He     suggested  that  tests  of 


homogeneous  difficulty  will  yield  factors  which  will  be  more 
interpretable  in  terms  of  content.       Subtests  may  be  useful 
for  this  type  of  analysis  if     items  are  grouped  on  the  basis 
of  difficulty. 

Another  data  characteristic  that  may  affect  the  outcome 
of  dimensionality  studies  is  item  discrimination.  Reckase 
(1979)       investigated  one  and     three  parameter  logistic 
models    under      different    conditions      of  dimensionality. 
He    used  five  real     data  sets     and  five  generated  data  sets. 
Both     groups  of     data  varied     in     dimensionality  and  size  of 
factor  loading  on  defining  items.     He    performed  classical 
item  analysis,   logistic  item  analysis,  and  factor  analysis 
on  each  set  of  data.     He  demonstrated  that  the  one  parameter 
model     estimates     the     sum    of  the     factors  while  the  three 
parameter  model  related  most  highly  to  a     particular  factor, 
but  not    necessarily  the  first.     He  also  demonstrated  that 
the  average  discrimination  parameter  in  the  three  parameter 
model  was  strongly  related  to  the  size  of  the  first  eigenvalu 
even  when  the  data  were  multidimensional.        The  standard 
deviations     of     the     discrimination     indices     fell     as  the 
eigenvalues  increased,  indicating    an      inverse  relationship 
between  the  dimensionality  of  the  data  and  the  stability  of 
the  discrimination  estimates. 

Two  other  key  variables  which  have  been  found  to  influence 
outcomes     of     factor     analytic     studies     are     the  number  of 
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items  and  the  number  of  subjects.     Two  articles  differed  on 
the    number    of     items     to    include     in    a  factor  analytic 
investigation    of  dimensionality.       Linn  (1968)  simulated 
a  variety  of     characteristics  of  data  and  compared  different 
methods  of  determining  the  number  of  factors.     He  found  that 
the    common  criterion  of  retaining  factors  with  eigenvalues 
greater  than  unity  inflated  the  number  of  factors,  depending 
on  the  degree    of  sampling    error.       When  the  proportion  of 
factors  to  items  was  small,  the  results     tended  to  be  more 
stable,   and  the  number  of  factors  was  more  accurate.  Higher 
communalities     produced    more     stable     results     and  better 
estimates  of  the  correct  number  of  factors.       He  also  found 
that  the     larger  the     sample     size,   the  better  the  results. 
However,  Mulaik  and  McDonald  (1978)  noted  many  drawbacks  to 
factor  models  from  a  domain  sampling  standpoint.       A  larger 
set  of  items  will  not  necessarily  better  determine  the  factor 
space,  and  not  even  the  use    of  reference  variables  assures 
stable     factor     solutions.     They  recommended  more  items  and 
reference  variables  from  a  practical  standpoint.     They  also 
recommended  "A  behavioral  domain  defined  in  terms  of  clearly 
stated  measurement  operations  should  reduce  considerably  the 
potential  for  alternative  solutions     and  interpretations  to 
turn  up  in  factor  analysis  of  items  from  such  domains"    (p. 191). 

The  "roots  greater     or  equal  to  one"     criterion  used  in  the 
present  study  has  been     investigated  in  several  studies  in 
which  sample  size  and  number  of  variables  were  manipulated. 


Cattell  and  Jaspers   (1967)     used  a  known     number  of  factors 
and  generated  correlation    matrices,     varying  the  number 
of  variables     and    number     of     subjects.        The  criterion 
underestimated     six  and  correctly    estimated    two  of  the 
matrices.       Browne   (1968)   found  similar     effects  in  small 
samples,  but  for  larger  samples  found     that     accuracy  was 
greatly  improved.       Browne     compared  maximum  likelihood, 
Thomson's  method   (iterated  principle  components),  principle 
components  (axis),     weighted     principle    components  (axis), 
and  centroid  factor  analysis  methods.     He    generated  sample 
correlation  matrices,  varying  the  sample  size  and  number  of 
variables,     keeping  the  number  of  factors  constant  at  four. 
The     communal ities     covered  a  wide     range  of     values.  He 
evaluated  the  accuracy  of  the  methods  by     comparing  them  to 
the  population  matrices  from    which  the     sample  correlation 
matrices  were  generated.       The  maximum  likelihood  procedure 
was     slightly     superior  in  estimating  the     factor  loadings. 
The  effect  was     small  for  small     samples  when  compared  with 
the  two  principle  factor  methods.     No  method  for  estimating 
the    number    of  factors     proved     completely  satisfactory. 
The  sequential  testing  of  the  number    of  factors  using  the 
likelihood  ratio  test    and  the  "roots     greater    than  one" 
criteria  gave  the  best  results. 

Tucker,   Koopman,     and  Linn   (1969)     simulated  correlation 
matrices.       They  based  their  model     on  domains  of  factors: 

major,  minor,  and  unique  factors.     They  varied  the  ranges  of 
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the  coefficients,     the  size  of     the  major  domain   (number  of 
factors),  and  the  relationship  between  the  types  of  factors, 
i.e.,     the  major,  minor,  and  unique  factors.       Guttmans '  test 
for  the  lower  bounds  became     less  effective  where  the  minor 
and  unique  factors  were  included  in  the  model.       The  "roots 
greater  than     one"  criterion     on  principle     components  was 
better  than  Guttmans'     method.       The  scree  test    was  most 
useful  in  the  major  factor  model.       They  found  a  better  fit 
to  the     original  factors     when  they     extracted  additional 
factors  in    the  models    which  included    minor  and  unique 
factors.        They  found     communality  estimation  generally 
improved  the  fit.       The  general  findings  seem  to  favor  the 
"roots  greater  than  one"  criterion    as  a  statistical  method, 
but  showed    some  improvement    using  heuristic  approaches. 

In  considering  properties  of  test  scores  which  may  impact 
the  assessment  of  unidimensionality ,  many  factors     have  been 
theorized  to  be  potential  problems.         Some  problems  lie  in 
the  methods  of  assessing  the    unidimensionality  of  the  test, 
such  as    whether  it     is  a  linear    or  a    non-linear  model. 
Other  problems  may     lie  in  the     way  the  items     are  inter- 
related,  such  as  through  tetrachoric  correlations;  these  may 
not  produce  positive-definite  matrices     and  are  difficult  to 
estimate  in  the  extremes  of  their  range. 
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Given  the  above  difficulties,   it  may  still  be  possible  to 
use     or  interpret     some     common     methods  for  assessing 
unidimensionality  if  several  characteristics  of  the  data  are 
suitable.       A  smaller  number  of  items  in  a  test  may  improve 
the     assessment  of     unidimensionality     if     the  test  is 
unidimensional   (Lord  and  Novick,   1968).       A  larger  number  of 
subjects  could    provide  for    more  stable     factors,  again 
benefiting     the  assessment     of  unidimensionality.  The 
distribution  of  the     latent  trait  has  been     hypothesized  by 
Lord   (1980)   to  be  important,     with  departures  from  normality 
producing  spurious     factors.       The    variability  of  item 
difficulty  may  contribute  to     difficulty  factors  (McDonald, 
1981).       Since  factor     loadings  should  be  related     to  the 
discrimination     levels  of     the     items,     variability  in 
discrimination  may     produce  multiple     factors.  Guessing 
should  produce  multiple  factors   (Lord,     1980),     since  it  is 
differentially  related  to  more  difficult  items. 

In  conclusion,  based  on  these  findings  reported  in  the 
literature,   the  data  characteristics  of  sample  size,  range 
of  item  difficulty,   length  of  test,  distribution  of  the 
latent  trait,   and  the  data  generating  model  were  varied 
systematically  in  the  present  study. 


CHAPTER  III 
METHODS 


Design 

In     order  to     investigate  the     questions  presented  in 
Chapter  I,   a  simulation  approach  was  used.     The  conditions 
that  were  systematically  varied  were       a)  sample  size, 
b)  number  of  items,  c)  the  distribution  of  the  latent  trait, 
d)     the  range  of  item     difficulty,   and  e)   the  model  for 
generating  the  data.     These  conditions  were  combined  to 
provide  various  cells  of  the  design.     For  each  cell  of  the 
design,   the  data  were  replicated  20  times.     Details  of  the 
levels  of  these  conditions  and  the  method  of  generating  the 
data  are  described  in  the  following  sections. 

Sample  Size 

The  number  of  subjects  varied    over  three  levels:  100, 
500,  and  1000  subjects.       The  100  subject  samples  were  seen 
as  representative  of  a  very     small  study  aimed  at  assessing 
unidimensionality .     Smaller  sample  sizes  are  rare  in  studies 
of  this  type,     and  latent  trait  parameter  estimates  tend  to 
be  unstable  with  samples  smaller     than  300   (Hattie,  1984). 
The  500  subject  samples  were     chosen  as  typical  of  moderate 
sized  investigations  of  unidimensionality     and  conform  to 
the  sample  size     chosen  in  a  similar     simulation  by  Hattie 
(1984).       The  1000  subject  samples     were  used  to  represent 
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larger  studies  of  unidimensionality .        This  larger  sample 
size  was  typical  of  simulations     in  the  area  (Reckase,  1979; 
Yen,   1981;  Rudner,  Getson,  and  Knight,   1980).     Larger  sample 
sizes  would  have  been  desirable,     but    were  not  used  due  to 
limited  computer  resources. 

Number  of  Items 
The  number  of  items  varied  over  three  levels:   10,   20,  and 
30  items.       The     10  item  tests  were     intended  to  represent 
unidimensional  subscales  of  tests,  or  very  short  tests.  The 
20  item  tests  were  seen  as  typical  of  short  tests,     and  the 
30  item  tests     were  seen  as  average     length  unidimensional 
tests . 

Distribution  of  the  Latent  Trait 
The  distribution  of  the  latent     trait  varied  in  terms  of 
skewness  and  kurtosis.       Ten    different  distributions  were 
used,     including    a  normal  distribution  with     skewness  and 
kurtosis  of  0.0.     Table  1  presents  the  values  of  skewness  and 
kurtosis  which  were  used.     The  skewness,  which  describes  a 
distributions  deviation  from     symmetry,  ranged  between  -0.25 
to  1.75.     Kurtosis  values  ranged  from     -1.00  to  3.75.  The 
method  for  producing  distributions  with  the  desired  skewness 
and     kurtosis  was       developed     by  Fleishman     (1978).  The 
appendix  presents  histograms  of  each  of  the  10  distributions. 
Each     histogram  is  based  on  10,000     observations     from  the 
distribution.     The  variety  of  distributions  presented  in  these 
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TABLE  1 

Skewness 

and  kurtosis  of  parameter  values 

Skewness 

Kurtosis 

1.75 

3.75 

1.25 

1.50 

1.00 

3.75 

1.00 

0.50 

0.75 

0.00 

0.00 

3.75 

0.00 

2.00 

0.00 

0.00 

0.00 

-1.00 

-0.25 

3.75 

Figures  suggests  that  the  distributions  include  the  kinds  of 
latent  trait  distributions  likely  to  be  found  in  practice. 

Item  Difficulty 
Two  ranges  of  item  difficulty  were  used,    -2  to  2  and  -1 
to  1,  as  defined    in  the  noniial    ogive  and  logistic  models. 
For    each    range    item    difficulty    parameters    were  sairpled 
from  a    uniform   distribution.      Hattie  (1984)    used  these 
same  ranges  for  difficulty  is  his  simulation. 

Iten  Discrimination 
The  item   discrimination  parameter    was  sampled    from  a 
uniform  distribution    between  the  range    of  .36    to  1.788. 
With  the    normal  ogive  model    and  a    normally  distributed 
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latent  trait  this     range  corresponds  to  the    point  biserial 
correlation  of  . 3     to  . 8  for  item  and     latent  trait  scores. 
This  was  similar  to  the  range    of  other  similar  studies  and 
the  values     found  in  actual     tests  as  reported    by  Hattie 
(1984). 

Item  Guessing 

For  the     three  parameter  logistic  model,       the  guessing 
parameter  was  sampled    from  a  uniform  distribution    on  the 
interval  0.0  to  0.3,   inclusively.     This  covers  the  range  used 
by  other  studies  as  reported  by  Hattie  (1984). 

Data  Generation  Models 

Three  different  equations  were  used    to  generate  data  for 
each  of  the  three  models. 

For  the  normal     ogive  model,     the  method     described  by 
Reckase  (1979)  was  used.       The  following  steps  describe  the 
process  for  generating  the  scores  on  a  given  item. 

1.  Generate  item  discrimination  parameters  within  the 
ranges  described  above. 

2.  Generate     item  difficulty  parameters     within  the 
ranges  described  above. 

^        3.     For  each  simulated  examinee  generate  an  observation 
from  a  standard  normal     distribution.       Transform  this 
score  using  the  method     presented  by  Fleishman  (1978). 
This  step     creates  the  latent     trait  scores     with  the 
appropriate  skewness  and  kurtosis. 
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4.  For  each  examinee    generate  a  second  observation 
from  a  standard  normal  distribution. 

5.  Combine  the  transformed  score  (6)  and  the  standard 
normal  score  (E)  using  the  equation 


Yg      =      Pge  +  /1-Pg      Eg  O"!) 

Here  /  2 

Pg      =  Ag//l+Ag 

where  Ag  is  the  item  discrimination  parameter. 
6.      After  the  Yg  values     have  been  generated  for  all 
simulees,     dichotomize     the  distribution  at     the  Pgth 
percentile.       Here  Pg,     the    proportion  of  examinees 
answering  correctly,   is  obtained  using  the  relationship 
between  Pg     and  Ag  and     Bg   (Lord  and     Novick,  1968, 

Equation  16.9.3):         .   ^ 

/    2  - 

Pg     =     $      (-AgBg//l+Ag)  (3-2) 

where  $  is  the  cumulative  normal  distribution. 
These  dichotomized  scores  are  the  scores  on  the  gth  item  for 
a] 1  examinees . 

The     two  and     three    parameter     logistic  models  were 
implemented  by     adapting  the  program  DATGEN     (Hambleton  and 
Rovinelli,     1973).       The  steps     involved  in  the  simulation 
were  as  follows: 

1.       Generate  difficulty  and  discrimination  parameters 
as  described     above.       Additionally,   generate  the 
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guessing  parameters     for     the     three  parameter  model. 

2.  Generate  the  latent  trait     score  for  a  simulated 
examinee  as  described  in  step  3  above. 

3.  Using     the  appropriate  logistic    model,  calculate 
Pg(e). 

4.  Generate  a  score  from  a  uniform  distribution  on  the 
closed  interval  zero  to  one.       If  the  score  is  equal  to 
or  greater  than  P(e)  then  the  examinee  receives  a  score 
of  one     on  the     item  and  receives     a  score     of  zero 
otherwise . 

Tetrachoric  Correlations 
Before  calculating  the  matrix  of  tetrachoric  coefficients, 
all  items  that  were  answered     correctly  by  everyone  or  that 
were  missed     by  everyone  were     removed.       This     is  common 
practice,     since  the  correlation    coefficient  is  not  defined 
for  that  type  of  data.        The  tetrachoric  correlations  were 
calculated  using  the  subroutine  BECTR  from  the  International 
Mathematical  Subroutine  Library   (IMSL,  1982). 

Factor  Analysis 
A     factor  analysis  using  generalized  least  squares  was 
performed  to     obtain  the     chi-squared  test     of  model  fit 
(Joreskog  and  Goldberger,     1972).       The  subroutine  used  for 
the  calculations  was  OFCOMM  from     IMSL,       The  same  program 
library  was     used  to     calculate  the     eigenvalues  of  the 
correlation  matrix. 


CHAPTER  IV 
RESULTS 

The  purpose   of  the  simulation   was  to    investigate  the 
effects  of  test  length,  number  of  examinees,  distribution  of 
the  latent  trait,      range  of  item  difficulty,      and  itan 
response  model    on  two  indices  of    dimensionality.  More 
specifically,    the  purpose  was  to  investigate  whether  these 
factors    affected  the   degree    to   which  these  indices 
inappropriately  indicated   multidimensionality.      The  two 
indices  were  the    number  of  eigenvalues  of    the  tetrachoric 
correlation  matrix   greater  than   one  and    the  chi-square 
goodness  of  fit  test  for  a  one  factor  model.     The  design  of 
the  simulation  was  a  3 (test  length)  X  3 (number  of  examinees) 
X  10 (distribution  of  the  latent  trait)    X  2(iteTi  difficulty 
range)  X  3 (item  response  model)    design.      For  each  cell  of 
the  design  the  data  were  replicated  20  times. 

The  proportion   of  replications  in  which   the  eigenvalue 
criterion  indicated    unidimensionality  was    calculated  for 
each  conbination  of  the  5  preceding  factors.    Figures  4-1  to 
4-10  are  plots  of  these  proportions  against  the  ratio  of  itens 
to  sample  size  for  each  of    the  combinations  of  skewness  and 
kurtosis.     The  plots  refer  to  the  three  parameter  logistic 
model  and  the  wider  range  of  difficulty. 
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Inspection  of  the  plots  indicates    that  the  distribution 
of  the  latent  trait  had  little  or    no  effect  on  the  tendency 
of  the  eigenvalue  criterion  to  indicate  multidimensionality. 
The  plots  for  other   model -difficulty  range  combinations  and 
for  the    chi-square  criterion  also    indicated  the    lack  of 
effect   of    the    distribution    of    the    latent  trait. 
Consequently,  these  plots  were  not  reported. 

Figures  4-11  to  4-22  are  also  graphs  of  the  proportion  of 
unidimensional  replications    plotted  against  the    ratio  of 
itans  to  sample    size.      The  plots  refer    to  the  normally 
distributed  saitples.    Figures  4-11  and  4-12  depict  the  results 
for  the  eigenvalue  criterion  and    the  normal  ogive  model  for 
the   narrower    and   wider  ranges    of    item  difficulty 
respectively.    These  two  graphs  indicate  that  for  the  normal 
ogive  model  the  range  of  difficulty    had  little  or  no  effect 
on  the  eigenvalue  criterion.      The  graphs  also  indicate  that 
with  10    items  and    either  500    or  1000    examinees,  the 
eigenvalue    criterion  was    a  very    effective  indicator  of 
unidimensionality.      For  these  itan-sanple  size  combinations 
the  eigenvalue  criterion  indicated  a  single  trait  for  all  20 
replications.      The  eigenvalue  criterion    also  worked  well 
with  20  itans  and  either  500    or  1000  examinees.      With  30 
items  the    eigenvalue  criterion  consistently    indicated  too 
many  dimensions    for  all    the  sample    sizes.      With  100 
examinees  the    criterion  worked  well    only  for    10  itans, 
indicating    a  single    dimension  in    70    percent  of  the 
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replications  in  the    narrow  difficulty  range,    and    in  92 
percent  for  the  wider  difficulty  range. 

Figures  4-13  and  4-14  depict  the  results  for  the  chi-square 
criterion  and  the    normal  ogive  model  for    the  narrower  and 
wider  ranges  of  itan    difficulty  respectively.      These  two 
graphs  indicate  that  for  the  normal  ogive  model  the  range  of 
difficulty  has    little  or    no  effect    on  the  chi-square 
criterion.       The  graphs    also  indicate    that  with  1000 
examinees  and  10,  20,    or  30  items  or  with  500  examinees  and 
10  and  20  itens  the  chi-square  criterion  was  a  very  effective 
indicator  of  unidimensionality.    For  the  noraal  ogive  model, 
the  chi-square    criterion  worked  much  better    in  assessing 
dimensionality  than  the  eigenvalue  criterion.    The  only  case 
where  the    eigenvalue  criterion  proved  more    effective  with 
this  model  was  for  the  10  iton  and  100  examinee  group.  It 
appears  that  the  eigenvalue  criterion   may  be  more  sensitive 
to  the  number  of  items  than  the  chi-square  criterion. 

Figures  4-15  and  4-16  depict  the  results  for  the  eigenvalue 
criterion  and    the  two  parameter    logistic  model    for  the 
narrower  and  wider  ranges  of  difficulty  respectively.  These 
two  graphs  show  that  for  the    two  parameter  irodel  the  range 
of  difficulty  had  a  very    definite  effect  on  the  eigenvalue 
criterion.    VJith  the  broader  range  of  difficulty  the  iten  to 
saiiple    size    ratio    had  a    much    smaller    impact  on 
dimensionality    assessment.      In    fact,  dimensionality 
assessment  was  better    for  this  condition  than    any  of  the 
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other  conditions  examined  in  this  study.    Only  the  condition 
with    30  items    and  100    subjects    showed  any  incorrect 
assessments.       For  the    narrower    range  the  eigenvalue 
criterion  performed  poorly  with  the    100  subject  saitples  and 
with  the  30  item-500  subject  combination. 

Figures  4-17  and  4-18  depict  the  results  for  the  chi-square 
criterion  and    the  two  parameter    logistic  model    for  the 
narrower  and  wider  ranges  of  difficulty  respectively.  These 
two  graphs    shew  that    the  range    of  difficulty  strongly 
affected  the  assessment  of  dimensionality.      The  chi-square 
criterion  performed   very  well  with    the  narrow   range  of 
difficulty.      However,  in    the  wide  range  of    difficulty  it 
performed  well  only  on  the  10  item  conditions,  and  performed 
best  with  the    1000  or  500  examinee    conditions.      For  the 
narrow  range  the    only  conditions  in  which    the  chi-square 
criterion  performed  poorly  were  on  the    30  item  with  100  or 
500  examinee  samples.    For  the  two  paraireter  logistic  model, 
the  eigenvalue  criterion  worked  best    under  wider  ranges  of 
difficulty,      while  the  reverse  was  true  for  the  chi-square 
criterion. 

Figures  4-19  and  4-20  depict  the  results  for  the  eigenvalue 
criterion  and  the    three  parameter  logistic  model    for  the 
narrower  and  wider  ranges  of  difficulty  respectively.  These 
two  graphs  show  that  for  the  three  parameter  model  the  range 
of  difficulty  had  little  effect  on  the  eigenvalue  criterion. 
The  eigenvalue  criterion  performed  poorly  for  all  but  the  10 
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item-1000  examinee  coi±)ination  in  both  difficulty  ranges  and 
the    10    item   with    500    examinee  cotibination  in  the  wider 
difficulty  range.     The  slightly  better    performance  of  the 
eigenvalue  criterion  under  the  wider  range  of  difficulty  was 
consistent  with  all  three  models. 

Figures  4-21  and  4-22  depict  the  results  for  the  chi-square 
criterion  and  the    three  parameter  logistic  model    for  the 
narrower  and  wider  ranges  of  difficulty  respectively.  These 
two  graphs  show  that  for  the  three  parameter  model  the  range 
of  difficulty  had  little  effect  on  the  chi-square  criterion. 
For  both  difficulty  ranges  the    criterion  performed  well  for 
all  but  the  100  examinee-20  itan  or  30  item  combinations  for 
both  ranges  of  difficulty.     As  with  the  other  models,  the 
criterion  perfonred  slightly  better  under  narrower  ranges  of 
difficulty.    With  data  generated  by  a  three  parameter  model, 
the  chi-square    criterion  performed   much  better    than  the 
eigenvalue  criterion  for  the  three  parameter  model. 

To  sumn:iarize  the    effects  of  the  factors    several  tables 
were  prepared.      Table  2  shows  the  effect  of  number  of  items 
on    the     ability    of    the    criteria     to  indicate 
unidimensionality.      The  eigenvalues    fron  the  tetrachoric 
correlation  matrix  provided  the  eigenvalues  greater  than  one 
criterion.      If  the  second  eigenvalue  was  greater  than  one, 
the  data  were  classified  as  multidimensional,    if  less  then 
one,  the  data  were  classified  as  unidimensional .      With  the 
chi-square    goodness  of    fit  criterion,      the  data  were 
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considered  unidimensional  if  a  one     factor  model  fit  at  the 
.05     level.       Otherwise     the     data     were  classified  as 
multidimensional.     Both  procedures  work  best  with  ten  items. 
In  particular  the     effectiveness  of  the  criteria     is  quite 
poor  when  there  are  thirty  items. 

Table  3  shows     the  effect  of  number  of     examinees  on  the 
ability  of  the  criteria  to  indicate  unidimensionality .  The 
chi-square  criterion    works  quite  well     with  500     or  1000 
examinees.       The  eigenvalue    criterion  generally  performed 
less  well  than  the  chi-square  criterion. 


TABLE  2 

Frequency  of  Unidimensional  and  Multidimensional 
Replications  for  Three  Test  Lengths 

Number  Number  of  items 

of 


Chi-square 
Eigenvalue 


Factors 

ten 

twenty 

thirty 

1 

3085 

2202 

1956 

>1 

515 

1398 

1644 

1 

2908 

1926 

832 

>1 

692 

1674 

2768 

To     display     the     effect     of     difficulty     range  on 
unidimensionality     assessment,       frequency     tables  were 
generated  for  the  two  ranges     of  difficulty  relative  to  the 
dimensionality  of  the  sample  for  each  criterion.       These  are 
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TABLE  3 


Frequency  of  Unidimensional  and  Multidimensional 
Replications  for  Three  Sample  Sizes 


Number 
of 


Number  of  subjects 


Criterion 

Factors 

100 

500 

1000 

Chi-square 

1 

853 

3132 

3258 

>1 

2747 

468 

342 

Eigenvalue 

1 

1007 

2176 

2483 

>1 

2593 

1424 

1117 

shown  in  Table     4.       The  narrow  difficulty     range  produced 
better  performance  for  the     chi-square  criterion  relative  to 
the  eigenvalue  criterion.       For  the  wide  range  of  difficulty 
the  eigenvalue  criterion  performed  slightly  better. 


TABLE  4 

Frequency  of  Unidimensional  and  Multidimensional 
Replications  for  Two  Difficulty  Ranges 


Number 
of 

Criterion      Factors       -1  to  1 


Difficulty  range 

-2  to  2 


Chi-square  1 
>1 

Eigenvalue  1 
>1 


4114 
1286 

2272 
3128 


3129 
2271 

3394 
2006 
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The  effect  of  the  different  models  is  suimarized  in  Table 
5,      which  shows    the  frequency    of  unidimensional  and 
raaltidimensional  samples  for  each  of  the  three  models.  The 
chi-square  criterion  performed  best   with  data  generated  by 
using  the    normal  ogive   model  and    the  three  parameter 
logistic  model.     The  eigenvalue    criterion  performed  best 
with  data    generated  by  using    the  two    parameter  logistic 
model.      The  eigenvalue  criterion  performed  worst  with  data 
generated  by    using  the    three  parameter    logistic  model. 
However,  the    chi-square  criterion  performed  best   with  this 
data. 


TABLE  5 


Frequency  of  Unidimensional  and  Multidimensional 
Replications  for  Three  Models 


Model 


Number 
of 

Criterion  Factors 


Normal 
Ogive 


Two  Three 
Parameter  Parameter 


Chi-square  1 
>1 


2374 
1226 


1942 
1658 


2927 
673 


Eigenvalue  1 


1884 
1716 


2885 
715 


897 
2703 


>1 


The  preceding  tables  ignore  the  interaction  between 
models,    number    of  examinees,    and  test    length.  The 
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relationships  of  model,    test  length,    and  sample  size  with 
the  proportion  of  replications    that  were  unidimensional  are 
displayed  in  Table    6  for  the  eigenvalue    criterion.  The 
results  clearly    indicate  the    interactive  nature    of  the 
effects.     With  ten  items  and  data  generated  by  using  either 
the  normal  ogive  or  the  two  parameter  model,    the  eigenvalue 
criterion  worked  at    least  reasonably  well  for    all  saitple 
sizes.      With  data  generated  by    using  the  three  parameter 
model,  it  did  not  work  at  all  well  with  100  examinees.  With 
20  items  and  data  generated  by    using  the  normal  ogive  model 
or  the  two  parameter  logistic   model,    the  criterion  worked 
well  only  with  sample  sizes  of    500  and  1000.      With  thirty 
itens  the  criterion  worked  poorly  except  with  1000  examinees 
and  data  generated  by  using  the  two  parameter  model. 

Table  7  shows  the  effects    of  model,    test  length,  and 
sample  size  combinations  on    the  proportion  of  replications 
that  were  unidimensional  by  the  chi-square  criterion.  Again 
the  results  indicate  the  interactive    nature  of  the  effects. 
With  10  items  the  chi-square    criterion  performs  poorly  only 
with  data  generated    using  the  normal  ogive    model  for  100 
examinees.     With  all  test  lengths,    and  either  500  or  1000 
examinees,    the  criterion  worked  well  with  data  generated  by 
using  the  normal  ogive  or  the  three  parameter  model. 
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TABLE  6 


Proportion  of  Replications  That  Were 
Unidimensional :  Eigenvalue  Criterion 


Test 

length 

Model 

Sample  size 

10 

20 

30 

Normal  Ogive 

100 

.81 

.  00 

.00 

500 

1 

.  00 

.90 

.06 

1000 

1 

.00 

.95 

.00 

Two  Parameter 

100 

.74 

.  50 

.37 

Logistic 

500 

1 

.  00 

.96 

.65 

1000 

1 

.00 

1.00 

.99 

Three  Parameter  100 

10 

.  00 

.00 

Logistic 

500 

,67 

.20 

.00 

1000 

95 

.31 

.00 

TABLE  7 

Proportion  of  Replications  That  Were 
Unidimensional ;  Chi-square  Criterion 


Test  length 


Model                Sample  size 

10 

20 

30 

Normal  Ogive 

100 

.04 

.00 

.00 

500 

1 

.00 

1.00 

.89 

1000 

1 

.  00 

1.00 

1 

.00 

Two  Parameter 

100 

.  76 

.00 

.00 

Logistic 

500 

.94 

.50 

.50 

1000 

.96 

.68 

.50 

Three  Parameter 

100 

.  99 

.32 

.00 

Logistic 

500 

1 

.00 

1.00 

1 

.00 

1000 

.  99 

1.00 

1 

.00 

CHAPTER  V 
DISCUSSION 

One  najor  objective  of  the  study  was  to  assess  the  impact 
of  latent    trait  non-normality  on    the  ability    of  factor 
analysis  to    indicate  a  single  trait    when  the  data    fit  a 
normal  ogive  model.      Surpisingly  there  was  a  cortplete  lack 
of  effect   of  skewness  and    kurtosis  on  the    assessment  of 
unidimensionality.      Lord    (1980)    had    hypothesized  that 
departures  from  normality  of  the  latent  trait  should  lead  to 
spurious  factors  even  when  there    is  a  single  latent  trait. 
Hattie  (1981)  pointed  out  that  tetrachoric  correlations  have 
the  assumption  of  normality,    making  spurious  factors  likely 
when  the  assumption  is  not  met.      A  second  objective  was  to 
assess  the  inpact    of  latent  trait  non-normality   when  the 
data  were  generated  by  using  either  a  two  or  three  parameter 
model.      Again  non-nontality  had  little  or  no  effect.  It 
appears  frcm  this  simulation,    that    the  criteria  used  here 
are    insensitive    to  even    substantial    departures  frcm 
normality. 

A  third  objective  was  to    investigate  the  impact  of  test 
length    and  saitple    size  on    the    effectiveness  of  the 
eigenvalue    and    chi-square  criteria    as    indicators  of 
unidimensionality.      The    effects  on    the  assessment  of 
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unidimensionality  due  to  the  number    of  items  and  the  number 
of  subjects   were  generally  consistent  with    the  literature 
(Linn,  1968).      As  can  be  seen  in  Table  4  and  the  graphs  in 
Figures  4-11  to  4-22,  increasing  the  number  of  items  seems  to 
increase  the  likelihood  of  finding    a  spurious  second  factor 
in  the  simulation.    It  seems  likely  to  find  spurious  factors 
in  real  data  with  many    items.      Mulaik  and  McDonald  (1978) 
suggested  that   more  items  may    better  define    the  factor 
space,    but  had  many  cautions  in  terms  of  the  definition  of 
that  danain.      In  contrast.  Linn  (1968)    found  more  stable 
solutions  when  the  proportion  of    number  of  itons  to  factors 
would  better    assess  unidimensionality,    with    fewer  items 
producing  the  best  results.      Lord  and  Novick  (1967)  also 
suggested  that  fewer  items  would    help  in  the  assessment  of 
unidimensionality. 

The  fewer  the  number  of  subjects,  the  more  likely  random 
factors  would  be  found.  For  500  or  1000  examinees  the  chi- 
square  criterion  seons  more  effective  than  the  eigenvalue 
criterion  (Table  3).  Linn  (1968)  found  more  accurate  factor 
representation  from  sairples  with  larger  numbers  of  subjects. 
Browne  (1968)  also  found  that  large  samples  were  needed  for 
accurate  factor  representation. 

The  proportion  of  itans  to  subjects    did  not  seem  to  be  a 
very  good    index  to  predict    the  accuracy  of    the  results. 
This  can  be  seen    in  Table  6  and  Table  7,     where  the  1000 
subject  and  20  item  sanples  have  the  same  proportions  as  the 
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500  subject   and  10    item  saiiples,      but  assessment  of 
unidimensionality  was    better  with  shorter    tests.  This 
effect  can  also  be  seen  in    Figures  4-11  to  4-22,    where  the 
lines  don't  descend  in  a  smooth  fashion. 

It  is  important  to  note    that  test  length  interacted  with 
sanple  size,  test  model,    and  criterion.      For  exaitple  with 
all  test  lengths  and  with  either  500  or  1000  examinees,  the 
chi-square  criterion  worked  well    with  data  generated  using 
the  normal    ogive  or  three    parameter  model.       With  data 
generated  using    the  two  parameter  model,      the  chi-square 
criterion   worked  poorly    for    20    and  30    item  tests, 
irrespective  of  sanple  size,  but  worked  well  for  all  with  10 
items  and  all  sanple  sizes.      With  500  or  1000  examinees  and 
tests  of  10  or  20    items,    the  eigenvalue  criterion  worked 
well  with  data    generated  by  using  the  normal    ogive  or  the 
two  parameter  model.    It  also  worked  reasonably  well  in  four 
other  conditions:  two  parameter  model,  30  itans,  500  or  1000 
examinees;    three  parameter  model  with  10  items,  500  or  1000 
examinees . 

The  range  of  difficulty  had  a  different  effect,  depending 
on  the  criterion  for  unidimensionality,  (see  Table  4).  The 
eigenvalue    criterion    showed  a    better    indication  of 
unidimensionality  with  the  larger  range  of  difficulty.  The 
converse  was  true  of  the  chi-square  criterion. 

The  effects  due  to  the  models  were  coiplex  and  typically 
difficult  to  explain.      Table  6    showed  that  the  eigenvalue 
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criterion  performed  best   with  data  generated  by   using  the 
two  parameter  model,      second  best  with  data    generated  by 
using  the  normal  ogive  model,    and  worst  with  data  generated 
using  the  three    parameter  model.      The  latter    result  is 
sensible  because    the  non-zero    pseudo-guessing  parameter 
probably  attenuates  the  inter- itsn    correlations.      This  in 
turn  increases  the  size  of  the  second  eigenvalue.    The  first 
two  results  are  the  more  difficult  to  explain.      The  normal 
ogive   and    the    two   parameter    logistic   model  are 
quantitatively  and  qualitatively  quite  similar.    As  a  result 
it  is    reasonable  to  expect    the  eigenvalue    criterion  to 
perform  quite  similarly  with  data    generated  by  using  either 
model.      Table  7  showed  that  the  chi-square  criteria  worked 
best  with  data  generated  by  using  the  three  parameter  model, 
second  best   with  data  generated    using  the    normal  ogive 
model,    and   worst  with  data    generated  by  using    the  two 
parameter  model.     The  difference  in  performance  observed  in 
connection  with  the  latter  two    models  is  again  difficult  to 
explain  since  these  two  models  are  quite  similar.      The  two 
parameter  model    is  actually  a    special  case  of    the  three 
parameter  model,    obtained  by    setting  the  pseudo-guessing 
parameter  to  zero.       It  is  unclear  why  this    would  have  a 
deleterious  effect  on  the  assessment  of  unidimensionality. 
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Based  on  the  preceding  results,  linear  factor  analysis  can 
be  recottnnended    confidently  for  assessing  unidimensionality 
with  tests  of  10  items  or    less  administered  to  500  or  more 
examinees.      Under    these  conditions    both  criteria  were 
reasonably  good  indicators  of    dimensionality,    even  as  the 
model  for    generating  the  data    was  varied.       For  these 
conditions    the  chi-square    criterion    was  sonewhat  more 
effective  than  the  eigenvalue  criterion.    However,  even  this 
recommendation  can  not  be   made  without  reservation.  The 
present    simulation    did  not    investigate  dimensionality 
assessment    for  data    known    to    be  generated    fron  a 
multidimensional  latent  trait.    Thus  we  can  not  be  sure  that 
the  two    criteria  are  sensitive  to   multidimensional  latent 
traits.     This  is  an  area    in  which  additional  research  is 
required. 

The  results    also  give    a  measure    of  support    to  the 
reccmmendation  to    use  the  chi-square  criterion    to  assess 
unidimensionality  with  multiple  choice  test  data.    With  data 
generated  using  the  three    parameter  model,    assessment  of 
unidimensionality  using    the  chi-square  criterion    was  good 
with  10  items  and  all  saitple  sizes,    and  with  20  or  30  itotis 
and  sample    sizes  of  500    or  1000  examinees.       The  three 
parameter   model  allows    for  the    possibility  that  even 
students  with  very  low  latent    trait  scores  have  a  non-zero 
probability  of    answering  an  iten  correctly.       This  would 
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occur  if  lew  ability  examinees  tend    to  attanpt  to  guess  the 
correct  answers.       Since  guessing    seems  likely   on  most 
multiple  choice  examinations,  a 'priori  one  might  expect  that 
the  three  parameter   model  is  likely  to   be  more  consistent 
than  the  other  models  with  multiple  choice  test  data.  Thus 
the  results  support  the  recoitmendation   to  use  the  chi-square 
criterion  to  assess  unidimensionality    with  multiple  choice 
tests . 

However,  the  preceding  recomnnendations  must  be  tempered  by 
the  finding  that  the  chi-square    criterion  worked  less  well 
with  data  generated  by  using  the  two  parameter  model  than  it 
did  with  data  generated  by    using  the  three  parameter  model. 
Recalling  that  with    the  three  parameter  model    the  pseudo 
guessing  parameter    ranged  fron    .00  to    .30,    the  poor 
performance  by  the  chi-square    criterion  with  data  generated 
by  using  the    two  parameter  model  implies  that    it  may  also 
perform   poorly  if    the   pseudo-guessing  parameter  is 
distributed  in  a    narrower  range  than  .00  to  .30    or  is  not 
distributed  uniformly  in  the  range  .00  to  .30.  Additional 
research    is  required    to  establish    the    limits  of  the 
generalizability  of  the  recommendations. 

The  generalizability  of  all  of  the  results  is  limited  by 
the  fact  that  only  one  range    and  distribution  was  used  for 
the  discrimination  parameter,    and  only  one  distribution  was 
used  for  the  difficulty  parameter.      Additional  research  is 
required  to    investigate  the  impact    of  these    factors  on 


dfimensionality  assessment.       In  addition  there    are  other 
criteria  for  unidiinensionality  and  other    indices  of  inter- 
item  relationships    that  should  be  investigated    in  future 
research. 
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